trigger ReplayLog fallback only after candidate no-connection timeout#459
trigger ReplayLog fallback only after candidate no-connection timeout#459
Conversation
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
WalkthroughAdds per-node-group replay-tracking to the log replay service: introduces a CandidateReplayWatch struct and a map to track per-ng term and timestamps; reduces notify timeout from 10s to 5s; changes timeout handler to apply interval-based gating and avoid redundant ReplayLog requests when inbound recovery streams exist. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip You can enable review details to help with troubleshooting, context usage and more.Enable the |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tx_service/src/fault/log_replay_service.cpp (1)
129-144: Avoid scanning every stream underinbound_mux_once per node group.This new check is O(node_groups × streams) every timeout cycle because
inbound_connections_is keyed byStreamId. On larger clusters, that can turn the fallback path into a lock-heavy poller and delayConnect/on_closed. A small secondary index or per-<cc_ng_id, cc_ng_term>recovering-stream count would make this much cheaper.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tx_service/src/fault/log_replay_service.cpp` around lines 129 - 144, The loop scanning inbound_connections_ under inbound_mux_ for every node group timeout is inefficient; add a secondary index (e.g., an unordered_map or nested map keyed by std::pair<cc_ng_id, cc_ng_term> or a struct key) that tracks the count of recovering streams to avoid O(node_groups×streams) scans. Maintain this index whenever entries are inserted/erased or when an entry's recovering_ flag changes (update in the same critical sections that currently touch inbound_connections_), then replace the scan in the has_replay_connection check with a constant-time lookup against that new map; keep using inbound_mux_ for synchronization and update the index in the same protected scope as inbound_connections_ to preserve correctness.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tx_service/src/fault/log_replay_service.cpp`:
- Around line 145-167: The scan loop should reset the "no-connection" timer when
a healthy replay stream exists: when has_replay_connection is true, update
watch.first_seen_ts (and/or clear it) to the current timestamp (now_ts) so
first_seen_ts reflects the most recent connected time rather than the original
observation; this prevents immediately enqueuing ReplayLog (ReplayLog(ng_id,
candidate_term, ...)) on the next scan after a short-lived drop. Locate the
check using has_replay_connection, watch.first_seen_ts, watch.last_replay_ts,
and ReplayLog and add logic to set watch.first_seen_ts = now_ts when a replay
stream is present.
---
Nitpick comments:
In `@tx_service/src/fault/log_replay_service.cpp`:
- Around line 129-144: The loop scanning inbound_connections_ under inbound_mux_
for every node group timeout is inefficient; add a secondary index (e.g., an
unordered_map or nested map keyed by std::pair<cc_ng_id, cc_ng_term> or a struct
key) that tracks the count of recovering streams to avoid O(node_groups×streams)
scans. Maintain this index whenever entries are inserted/erased or when an
entry's recovering_ flag changes (update in the same critical sections that
currently touch inbound_connections_), then replace the scan in the
has_replay_connection check with a constant-time lookup against that new map;
keep using inbound_mux_ for synchronization and update the index in the same
protected scope as inbound_connections_ to preserve correctness.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 09db1326-0dda-43d4-8e1e-30ecf2609d85
📒 Files selected for processing (2)
tx_service/include/fault/log_replay_service.htx_service/src/fault/log_replay_service.cpp
Here are some reminders before you submit the pull request
fixes eloqdb/tx_service#issue_id./mtr --suite=mono_main,mono_multi,mono_basicSummary by CodeRabbit